This R notebook is a walkthrough of a tutorial of the Seurat library in R. It will use 2700 Peripheral Blood Mononuclear Cells (PBMC). This tutorial’s contents have been modified for the McGill MICM workshop in dimension reduction. The full tutorial can be found at the Satija lab website. They also have a GitHub Wiki with more technical discussion.
The Seurat library was developed to smooth the process of carrying out cell-sequencing analysis. In this workshop, it is meant to illustrate what a single-cell workflow can look like and how dimension reduction fits into it. Unlike the Python tutorial, we will be working with a Seurat object which contains both our data and our analyses. We will be working with matrices of cells (columns) and genes (rows). The matrix contents are counts of unique molecular identifiers (UMIs, the count of each gene).
While this walkthrough is specific to single-cell analysis using the Seurat package, its principles can be applied to any data. We use this specific tutorial because single-cell analysis is a hot topic these days and because the library is already prepared to do a lot of the more tedious heavy lifting for us, which could otherwise take a chunk of time to explain.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(Seurat)
library(patchwork)
# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "data/filtered_gene_bc_matrices/hg19")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
pbmc
## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
# What do our data actually look like?
pbmc.data[c("CD3D","TCL1A","MS4A1"), 1:30]
## 3 x 30 sparse Matrix of class "dgCMatrix"
## [[ suppressing 30 column names 'AAACATACAACCAC-1', 'AAACATTGAGCTAC-1', 'AAACATTGATCAGC-1' ... ]]
##
## CD3D 4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2 3 . . . . . 3 4 1 5
## TCL1A . . . . . . . . 1 . . . . . . . . . . . . 1 . . . . . . . .
## MS4A1 . 6 . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .
The . values in the matrix represent 0s (no molecules detected). Most data here is sparse, which means the majority of entries are 0.
# Seurat stores objects in a special sparse format to save memory.
dense.size <- object.size(as.matrix(pbmc.data))
cat("Dense size: ")
## Dense size:
cat(utils:::format.object_size(dense.size, "Mb"))
## 676.7 Mb
cat("\n")
sparse.size <- object.size(pbmc.data)
cat("Sparse size: ")
## Sparse size:
cat(utils:::format.object_size(sparse.size, "Mb"))
## 28.5 Mb
cat("\n")
cat("Ratio of sizes: ", dense.size / sparse.size)
## Ratio of sizes: 23.72804
Now that our data has been read into memory, we can begin pre-processing.
The steps below encompass the standard pre-processing workflow for scRNA-seq data in Seurat. These represent the selection and filtration of cells based on QC metrics, data normalization and scaling, and the detection of highly variable features. ## QC and selecting cells for further analysis Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. A few QC metrics commonly used by the community include * The number of unique genes detected in each cell. + Low-quality cells or empty droplets will often have very few genes + Cell doublets or multiplets may exhibit an aberrantly high gene count * Similarly, the total number of molecules detected within a cell (correlates strongly with unique genes) * The percentage of reads that map to the mitochondrial genome + Low-quality / dying cells often exhibit extensive mitochondrial contamination + We calculate mitochondrial QC metrics with the PercentageFeatureSet() function, which calculates the percentage of counts originating from a set of features + We use the set of all genes starting with MT- as a set of mitochondrial genes
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
# Show QC metrics for the first 5 cells
head(pbmc@meta.data, 5)
## orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759
## AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958
## AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363
## AAACCGTGCTTCCG-1 pbmc3k 2639 960 1.7430845
## AAACCGTGTATGCG-1 pbmc3k 980 521 1.2244898
In the example below, we visualize QC metrics, and use these to filter cells. * We filter cells that have unique feature counts over 2,500 or less than 200 * We filter cells that have >5% mitochondrial counts
#Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
# FeatureScatter is typically used to visualize feature-feature relationships, but can be used for anything calculated by the object, i.e. columns in object metadata, PC scores etc.
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
An important step in analysis is normalization of the data to make sure our data are all on the same scale. In this example, the tutorial uses a “LogNormalize” function.
# Normalizes by total expression, multiplies by a scale (10,000 default), and applies a log-transformation
# Normalized values stored in pbmc[["RNA"]]@data
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 1e4)
Typically in single-cell analysis we’re interested in seeing if (for example) a particular gene is especially highly expressed in some cell type. Focusing on these cells/genes is a productive approach (see this provided referece). Seurat has been especially good at integrating multiple types of single-cell data (see Stuart et al (2019) and Hao et al (2020) for further details.)
By default, this method (FindVariableFeatures) returns 2000 features, which we can use for further dimension reduction (PCA and UMAP here) and analysis.
pbmc <- FindVariableFeatures(pbmc, selection.method = 'vst', nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
plot1 + plot2
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
Here we use a linear transformation to make sure the expression levels of every cell have mean 0 and standard deviation 1. Since PCA and UMAP can be biased by outliers, this makes sure that genes with very high expression levels don’t have overwhelming influence over the results.
# results are stored in `pbmc[["RNA"]]@scale.data`
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
## Centering and scaling data matrix
Scaling is an essential step in the Seurat workflow, but only on genes that will be used as input to PCA. Therefore, the default in ScaleData() is only to perform scaling on the previously identified variable features (2,000 by default). To do this, omit the features argument in the previous function call, i.e.
pbmc <- ScaleData(pbmc)
Your PCA and clustering results will be unaffected. However, Seurat heatmaps (produced as shown below with DoHeatmap()) require genes in the heatmap to be scaled, to make sure highly-expressed genes don’t dominate the heatmap. To make sure we don’t leave any genes out of the heatmap later, we are scaling all genes in this tutorial.
In Seurat v2 we also use the ScaleData() function to remove unwanted sources of variation from a single-cell dataset. For example, we could ‘regress out’ heterogeneity associated with (for example) cell cycle stage, or mitochondrial contamination. These features are still supported in ScaleData() in Seurat v3, i.e.:
pbmc <- ScaleData(pbmc, vars.to.regress = 'percent.mt')
However, particularly for advanced users who would like to use this functionality, we strongly recommend the use of our new normalization workflow, SCTransform(). The method is described in our paper, with a separate vignette using Seurat v3 here. As with ScaleData(), the function SCTransform() also includes a vars.to.regress parameter.
***
Next we perform PCA on the scaled data. By default, only the previously determined variable features are used as input, but can be defined using features argument if you wish to choose a different subset.
Note that this is Seurat’s PCA command. In base R, you can run PCA using the prcomp or princomp.
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP
## FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP
## PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD
## Negative: MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW
## CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A
## MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74
## HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB
## BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74
## Negative: NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2
## CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX
## TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA
## HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8
## PLAC8, BLNK, MALAT1, SMIM14, PLD4, LAT2, IGLL5, P2RX5, SWAP70, FCGR2B
## Negative: PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU
## HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1
## NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA
## PC_ 4
## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HLA-DPB1, HIST1H2AC, PF4, TCL1A
## SDPR, HLA-DPA1, HLA-DRB1, HLA-DQA2, HLA-DRA, PPBP, LINC00926, GNG11, HLA-DRB5, SPARC
## GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, GZMB, CLU, TUBB1
## Negative: VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL
## AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7
## LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6
## PC_ 5
## Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2
## GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, CCL5
## RBP7, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX
## Negative: LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1
## PTGES3, LILRB2, MAL, CD27, HN1, CD2, GDI2, ANXA5, CORO1B, TUBA1B
## FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB
You can access the principal component vectors from the Seurat object using pbmc[['pca']]
head(pbmc[['pca']][,1:5]) # Print the top 5 PCs for the first few genes
## PC_1 PC_2 PC_3 PC_4 PC_5
## PPBP 0.010990202 0.01148426 -0.15176092 0.10403737 0.003299077
## LYZ 0.116231706 0.01472515 -0.01280613 -0.04414540 0.049906881
## S100A9 0.115414362 0.01895146 -0.02368853 -0.05787777 0.085382309
## IGLL5 -0.007987473 0.05454239 0.04901533 0.06694722 0.004603231
## GNLY -0.015238762 -0.13375626 0.04101340 0.06912322 0.104558611
## FTL 0.118292572 0.01871142 -0.00984755 -0.01555269 0.038743505
Seurat provides several useful ways of visualizing both cells and features that define the PCA, including VizDimReduction(), DimPlot(), and DimHeatmap()
# Examine and visualize PCA results a few different ways
print(pbmc[['pca']], dims = 1:5, nfeatures = 5)
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL
## Negative: MALAT1, LTB, IL32, IL7R, CD2
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1
## Negative: NKG7, PRF1, CST7, GZMB, GZMA
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1
## Negative: PPBP, PF4, SDPR, SPARC, GNG11
## PC_ 4
## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1
## Negative: VIM, IL7R, S100A6, IL32, S100A8
## PC_ 5
## Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY
## Negative: LTB, IL7R, CKB, VIM, MS4A7
VizDimLoadings(pbmc, dims = 1:2, reduction = 'pca')
DimPlot(pbmc, reduction = 'pca')
In particular
DimHeatmap() allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores. Setting cells to a number plots the ‘extreme’ cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. Though clearly a supervised analysis, we find this to be a valuable tool for exploring correlated feature sets.
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
# Determine the ‘dimensionality’ of the dataset
There are many arguments about finding the “dimensionality” of the dataset. One common question is: how many PCs are “enough” for our purposes? There are multiple methods, such as scree plots, to pick out how much variation you want to explain (recall that PCs explain variance from most-to-least). I usually recommend running multiple sets of top PCs (5, 10, 15, 20, etc). There can also be subject-matter specific decisions, as some cells might not form clusters unless (for example) the top 15 PCs are used.
ElbowPlot(pbmc) # Scree plot (look for the "elbow")
This is a short explanation of clustering from the tutorial:
Seurat v3 applies a graph-based clustering approach, building upon initial strategies in (Macosko et al). Importantly, the distance metric which drives the clustering analysis (based on previously identified PCs) remains the same. However, our approach to partitioning the cellular distance matrix into clusters has dramatically improved. Our approach was heavily inspired by recent manuscripts which applied graph-based clustering approaches to scRNA-seq data [SNN-Cliq, Xu and Su, Bioinformatics, 2015] and CyTOF data [PhenoGraph, Levine et al., Cell, 2015]. Briefly, these methods embed cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’.
As in PhenoGraph, we first construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the FindNeighbors() function, and takes as input the previously defined dimensionality of the dataset (first 10 PCs).
To cluster the cells, we next apply modularity optimization techniques such as the Louvain algorithm (default) or SLM [SLM, Blondel et al., Journal of Statistical Mechanics], to iteratively group cells together, with the goal of optimizing the standard modularity function. The FindClusters() function implements this procedure, and contains a resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters. We find that setting this parameter between 0.4-1.2 typically returns good results for single-cell datasets of around 3K cells. Optimal resolution often increases for larger datasets. The clusters can be found using the Idents() function.
pbmc <- FindNeighbors(pbmc, dims = 1:10)
## Computing nearest neighbor graph
## Computing SNN
pbmc <- FindClusters(pbmc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 96033
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8720
## Number of communities: 9
## Elapsed time: 0 seconds
# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## 1 3 1 2
## AAACCGTGTATGCG-1
## 6
## Levels: 0 1 2 3 4 5 6 7 8
Seurat has its own functions to run UMAP and t-SNE. There are also many R libraries with implementations of t-SNE (e.g. Rtsne) and UMAP (e.g. umap and uwot). Generally PCs are used as input to the methods for computational reasons.
# If you haven't installed UMAP, you can do so via reticulate::py_install(packages = "umap-learn")
pbmc <- RunUMAP(pbmc, dims = 1:10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 18:32:46 UMAP embedding parameters a = 0.9922 b = 1.112
## 18:32:46 Read 2638 rows and found 10 numeric columns
## 18:32:46 Using Annoy for neighbor search, n_neighbors = 30
## 18:32:46 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 18:32:47 Writing NN index file to temp file /var/folders/bc/5j5j70q94n35rr3xtnqy_lv40000gn/T//RtmplY16Uq/filedd3f4b38e17c
## 18:32:47 Searching Annoy index using 1 thread, search_k = 3000
## 18:32:48 Annoy recall = 100%
## 18:32:48 Commencing smooth kNN distance calibration using 1 thread
## 18:32:49 Initializing from normalized Laplacian + noise
## 18:32:49 Commencing optimization for 500 epochs, with 105132 positive edges
## 18:32:53 Optimization finished
# note that you can set `label = TRUE` or use the LabelClusters function to help label individual clusters
DimPlot(pbmc, reduction = 'umap')
These can take some time to run with larger datasets, so it’s recommended to use saveRDS and readRDS to store and recover your work. (Note: this works for R objects in general!)
#saveRDS(pbmc, file = "../output/pbmc_tutorial.rds") # not used in walkthrough
Seurat can help you find markers that define clusters via differential expression. By default, it identifies positive and negative markers of a single cluster (specified in ident.1), compared to all other cells. FindAllMarkers() automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells. The min.pct argument requires a feature to be detected at a minimum percentage in either of the two groups of cells, and the thresh.test argument requires a feature to be differentially expressed (on average) by some amount between the two groups. You can set both of these to 0, but with a dramatic increase in time - since this will test a large number of features that are unlikely to be highly discriminatory. As another option to speed up these computations, max.cells.per.ident can be set. This will downsample each identity class to have no more cells than whatever this is set to. While there is generally going to be a loss in power, the speed increases can be significant and the most highly differentially expressed features will likely still rise to the top.
# find all markers of cluster 1
cluster1.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25)
head(cluster1.markers, n = 5)
## p_val avg_logFC pct.1 pct.2 p_val_adj
## S100A9 0.000000e+00 3.860873 0.996 0.215 0.000000e+00
## S100A8 0.000000e+00 3.796640 0.975 0.121 0.000000e+00
## LGALS2 0.000000e+00 2.634294 0.908 0.059 0.000000e+00
## FCN1 0.000000e+00 2.352693 0.952 0.151 0.000000e+00
## CD14 2.856582e-294 1.951644 0.667 0.028 3.917516e-290
# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
## p_val avg_logFC pct.1 pct.2 p_val_adj
## FCGR3A 7.583625e-209 2.963144 0.975 0.037 1.040018e-204
## IFITM3 2.500844e-199 2.698187 0.975 0.046 3.429657e-195
## CFD 1.763722e-195 2.362381 0.938 0.037 2.418768e-191
## CD68 4.612171e-192 2.087366 0.926 0.036 6.325132e-188
## RP11-290F20.3 1.846215e-188 1.886288 0.840 0.016 2.531900e-184
# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
## # A tibble: 18 x 7
## # Groups: cluster [9]
## p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 1.96e-107 0.730 0.901 0.594 2.69e-103 0 LDHB
## 2 1.61e- 82 0.922 0.436 0.11 2.20e- 78 0 CCR7
## 3 7.95e- 89 0.892 0.981 0.642 1.09e- 84 1 LTB
## 4 1.85e- 60 0.859 0.422 0.11 2.54e- 56 1 AQP3
## 5 0. 3.86 0.996 0.215 0. 2 S100A9
## 6 0. 3.80 0.975 0.121 0. 2 S100A8
## 7 0. 2.99 0.936 0.041 0. 3 CD79A
## 8 9.48e-271 2.49 0.622 0.022 1.30e-266 3 TCL1A
## 9 2.96e-189 2.12 0.985 0.24 4.06e-185 4 CCL5
## 10 2.57e-158 2.05 0.587 0.059 3.52e-154 4 GZMK
## 11 3.51e-184 2.30 0.975 0.134 4.82e-180 5 FCGR3A
## 12 2.03e-125 2.14 1 0.315 2.78e-121 5 LST1
## 13 7.95e-269 3.35 0.961 0.068 1.09e-264 6 GZMB
## 14 3.13e-191 3.69 0.961 0.131 4.30e-187 6 GNLY
## 15 1.48e-220 2.68 0.812 0.011 2.03e-216 7 FCER1A
## 16 1.67e- 21 1.99 1 0.513 2.28e- 17 7 HLA-DPB1
## 17 7.73e-200 5.02 1 0.01 1.06e-195 8 PF4
## 18 3.68e-110 5.94 1 0.024 5.05e-106 8 PPBP
Seurat has several tests for differential expression which can be set with the test.use parameter (see our DE vignette for details). For example, the ROC test returns the ‘classification power’ for any individual marker (ranging from 0 - random, to 1 - perfect).
cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
The package includes several tools for visualizing marker expression. VlnPlot() (shows expression probability distributions across clusters), and FeaturePlot() (visualizes feature expression on a tSNE or PCA plot) are the most commonly used visualizations. They also suggest exploring RidgePlot(), CellScatter(), and DotPlot() as additional methods to view your dataset. Outside of the library, you may use the standard plot function or graphics libraries such as ggplot2 and plotly.
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = 'counts', log = TRUE)
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
DoHeatmap() generates an expression heatmap for given cells and features. In this case, we are plotting the top 20 markers (or all markers if less than 20) for each cluster.
pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC) -> top10
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
*** # Assigning cell type identity to clusters Fortunately in the case of this dataset, we can use canonical markers to easily match the unbiased clustering to known cell types: Cluster ID | Markers | Cell Type ———–|—————|———- 0 | IL7R, CCR7 | Naive CD4+ T 1 | CD14, LYZ | CD14+ Mono 2 | IL7R, S100A4 | Memory CD4+ 3 | MS4A1 | B 4 | CD8A | CD8+ T 5 | FCGR3A, MS4A7 | FCGR3A+ Mono 6 | GNLY, NKG7 | NK 7 | FCER1A, CST3 | DC 8 | PPBP | Platelet
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = 'umap', label = TRUE, pt.size = 0.5) + NoLegend()
sessionInfo()
## R version 3.5.3 (2019-03-11)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] patchwork_1.0.0 Seurat_3.1.5 dplyr_0.8.4
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-137 tsne_0.1-3 bitops_1.0-6
## [4] RcppAnnoy_0.0.12 RColorBrewer_1.1-2 httr_1.4.2
## [7] sctransform_0.2.0 tools_3.5.3 utf8_1.1.4
## [10] R6_2.4.1 irlba_2.3.3 KernSmooth_2.23-15
## [13] uwot_0.1.8 lazyeval_0.2.2 colorspace_1.4-1
## [16] npsurv_0.4-0 withr_2.4.0 tidyselect_0.2.5
## [19] gridExtra_2.3 compiler_3.5.3 cli_2.2.0
## [22] plotly_4.9.0 labeling_0.3 caTools_1.17.1.2
## [25] scales_1.1.0 lmtest_0.9-37 ggridges_0.5.1
## [28] pbapply_1.4-2 stringr_1.4.0 digest_0.6.22
## [31] rmarkdown_2.5 pkgconfig_2.0.3 htmltools_0.3.6
## [34] limma_3.38.3 htmlwidgets_1.3 rlang_0.4.10
## [37] farver_2.0.1 zoo_1.8-5 jsonlite_1.7.2
## [40] ica_1.0-2 gtools_3.8.2 magrittr_1.5
## [43] Matrix_1.2-15 fansi_0.4.0 Rcpp_1.0.4
## [46] munsell_0.5.0 ape_5.3 reticulate_1.12
## [49] lifecycle_0.1.0 stringi_1.4.3 MASS_7.3-51.1
## [52] gplots_3.0.3 Rtsne_0.15 plyr_1.8.4
## [55] grid_3.5.3 parallel_3.5.3 gdata_2.18.0
## [58] listenv_0.7.0 ggrepel_0.8.1 crayon_1.3.4
## [61] lattice_0.20-38 cowplot_1.0.0 splines_3.5.3
## [64] knitr_1.23 pillar_1.4.2 igraph_1.2.4.1
## [67] future.apply_1.3.0 reshape2_1.4.3 codetools_0.2-16
## [70] leiden_0.3.1 glue_1.3.1 evaluate_0.13
## [73] lsei_1.2-0 data.table_1.12.8 vctrs_0.2.3
## [76] png_0.1-7 gtable_0.3.0 RANN_2.6.1
## [79] purrr_0.3.3 tidyr_1.0.2 future_1.15.0
## [82] assertthat_0.2.1 ggplot2_3.2.1 xfun_0.19
## [85] rsvd_1.0.0 RSpectra_0.14-0 survival_2.43-3
## [88] viridisLite_0.3.0 tibble_2.1.3 cluster_2.0.7-1
## [91] globals_0.12.4 fitdistrplus_1.0-14 ROCR_1.0-7